Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading #11845

rattus128 · 2026-01-13T10:20:55Z

To try it:

pip install requirements.txt
--fast dynamic_vram

NOTE: This work does not have any GGUF integration and GGUF will not see any benefits yet.

NOTE: I am aware of increase Windows RAM usage when not configuring a pagefile due to commit quota exhaustion. If anyone is testing please stay tuned for a major fix to windows RAM usage incoming. The VRAM stuff on windows is still testable. Linux is unaffected. (FIXED)

If you try it, please reply to the PR (if is hasn't been merged) with any issues or feel free to make an issue ticket for bigger test cases with logs and numbers.

Features

A new ModelPatcher implementation which backs onto comfy-aimdo to implement varying model load levels that can be adjusted during model use. The patcher defers all load processes to lazily load the model during use (e.g. the first step of a ksampler) and automatically negotiates a load level during the inference to maximize VRAM usage without OOMing. If inference requires more VRAM than is available weights are offloaded to make space before the OOM happens.
This will eventually allow for development of ComfyUI without needing to estimate model VRAM usage at all.
Large RAM and Windows commit-charge savings. No need to load models fully to RAM. This also gives a much higher chance of having a model in disk cache and saving the user from a disk load delay on first run as there is no primary load to process memory displacing the disk-cache any more.
Windows GPU shared memory usage avoidance
A deep copy of the model is cut in the safetensors save process (incidental improvement)
Reduced VRAM usage in async offload stream which cuda malloc disabled (pre-requisite improvement)

Implementation Details

Aimdo readme here: https://pypi.org/project/comfy-aimdo/

The long story on RAM: Aimdos ability to just evict weights means its no longer possible to .to() a weight back and forth from the GPU. VRAM pressure can occur at any time during inference and there is no clean way to .to() weights or modules back to the CPU while pytorch is stacked in the middle on a pending VRAM allocation. So as we can never .to() a weight we instead take the opportunity to leave the model parameter as known to pytorch on the CPU permanently with assign=True state dict loading. Since its never write touched it lives in mmap permanently and never consumes any process allocated RAM. Several community developers have flagged this as a possible major enhancement to comfy already and the needed changes to model load and unload align with the VRAM problems.

(NEW) Windows has extra RAM complications with its pessimistic allocation and how it forbids overcommit other than with the pagefile. Two changes are made to drastically reduce commit charge. Linear nn.Modules are now constructed without the placeholder weight as this consumes commit charge. The other change is a lightweight safetensors load that loads files in READ mode (safetensors package uses CoW) which avoids getting commit-charged for the whole model on file load.

As for loading the weight onto the GPU, that happens via comfy_cast_weights which is now used in all cases. cast_bias_weight checks whether the VBAR assigned to the model has space for the weight (based on the same load priority semantics as the original ModelPatcher). If it does, the VRAM as returned by the Aimdo allocator is used as the parameter GPU side. The caster is responsible for populating the weight data. This is done using the usual offload_stream (which mean we now have asynchronous load overlapping first use compute).

Pinning works a little differently. When a weight is detected during load as unable to fit, a pin is allocated at the time of casting and the weight as used by the layer is DMAd back to the the pin using the GPU DMA TX engine, also using the asynchronous offload streams. This means you get to pin the Lora modified and requantized weights which can be a major speedup for offload+quantize+lora use cases, This works around the JIT Lora + FP8 exclusion and brings FP8MM to heavy offloading users (who probably really need it with more modest GPUs). There is a performance risk in that a CPU+RAM patch has been replace with a GPU+RAM patch but my initial performance results look good. Most users as likely to have a GPU that outruns their CPU in these woods.

Some common code is written to consolidate a layers tensors for aimdo mapping, pinning, and DMA transfers. interpret_gathered_like() allows unpacking a raw buffer as a set of tensors. This is used consistently to bundle and pack weights, quantization metadata (QuantizedTensor bits) and biases into one payload for DMA in the load process reducing Cuda overhead a little. Some Quantization metadata was missing async offload is some cases which is now added. This also pins quantization metadata and consolidates the number of cuda_host_register calls (which can be expensive).

Model saving is reworked to avoid the force_cast_weights flag which doesnt make sense in ModelPatcherDynamic. This rework was able to cut a RAM copy of the model by doing on-the-fly model patching during the save process which worked out to be a nice RAM saving while fixing my API problem.

Aimdo (under the hood) links with Windows APIs to adjust load levels based on the WDDM target VRAM usage rather than using the pytorch/Cuda stack reported numbers (which are WDDMs lies). This means as soon as shared memory spilling occurs on Windows, weights will be unloaded until you get out of the spill state and inference state will move back to VRAM.

Offload streams now have an accompanying single shared cast buffer that grows as needed. This is to avoid significant waste and fragmentation in the cast buffer streams when offloading multiple weight sizes as we don't have cuda_malloc and the pytorch allocator completely isolates memory by stream. So go a little hands on the low level to keep those allocation pools minimized. This is applied to non --dynamic_vram case when using non cuda_malloc as it doesn reduce VRAM esp on flux2 with those huge and varying weights.

Future Work

Pin RAM management could use more optimization and try and get pins behind the mmap of active models in priority under RAM pressure to allow much more aggressive pin retention. The heuristic (currently free model x2) can also just be straight improved.
First iterations are slower than id hoped. Some multi-threading of CPU/RAM bottlenecking might allow for further run ahead and bottleneck saturation.
~~- The progress meter needs some work. Its jarring to have it stall of the first iteration when its doing a slow model load.~~ (DONE)

Example Test case:

Flux2 + Lora text to image.
RTX5090 with 8GB of VRAM consumed by non comfy application (24GB effective)
PCIE5 NVME, 96GB RAM.
Disk caches warm with model

Before:

________________________________________________________________________
Starting server

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 21458.36 MB usable, 17180.59 MB loaded, full load: True
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
loaded partially; 19731.54 MB usable, 18720.00 MB loaded, 15093.00 MB offloaded, 1152.00 MB buffer reserved, lowvram patches: 72
Initializing ControlAltAI Nodes
100%|██████████| 20/20 [00:25<00:00,  1.27s/it]
Requested to load AutoencoderKL
Unloaded partially: 1152.00 MB freed, 17568.00 MB remains loaded, 1152.00 MB buffer reserved, lowvram patches: 80
loaded completely; 190.98 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 39.58 seconds

General Memory Usage

Peak VRAM:

After (--fast dynamic_vram)

________________________________________________________________________
Starting server

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 17180MB Staged. 0 patches attached.
Found quantization metadata version 1
Detected mixed precision quantization
Using mixed precision operations
model weight dtype torch.bfloat16, manual cast: torch.bfloat16
model_type FLUX
Requested to load Flux2
Model Flux2 prepared for dynamic VRAM loading. 33813MB Staged. 138 patches attached.
Initializing ControlAltAI Nodes
100%|██████████| 20/20 [00:28<00:00,  1.41s/it]
Requested to load AutoencoderKL
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 34.79 seconds

General Memory usage:

Peak VRAM:

More test data to come. Most workflows I have run are faster with this.

Im testing various things and updates bugfixes etc but enough works for a PR.

socket-security · 2026-01-13T10:21:39Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	comfy-aimdo@0.1.2

View full report

MeiYi-dev · 2026-01-13T12:47:21Z

As a 16GB VRAM user using the LTX 2 model, the main issue for me currently is before VAE decoing occurs, the whole model gets offloaded to RAM, which is already loaded with the TEs/VAE/Latent Upscalers, etc, so it gets overloaded and spools onto pagefile. When in reality the max VRAM use of the VAE decoding part is 4GB with the "VAE decoded tiled" node, unloading the whole model (probably because of VAE estimations within comfyui not accounting for tiled decoding) is the biggest issue I have found and is the only reason why us 16GB VRAM / 32GB RAM users are experiencing issues where the TE reloads from disk (because it got unloaded to make space for the model) after changing the prompt and model reloads again, both of these contributing to huge slowdowns.

rattus128 · 2026-01-13T13:03:47Z

As a 16GB VRAM user using the LTX 2 model, the main issue for me currently is before VAE decoing occurs, the whole model gets offloaded to RAM, which is already loaded with the TEs/VAE/Latent Upscalers, etc, so it gets overloaded and spools onto pagefile. When in reality the max VRAM use of the VAE decoding part is 4GB with the "VAE decoded tiled" node, unloading the whole model (probably because of VAE estimations within comfyui not accounting for tiled decoding) is the biggest issue I have found and is the only reason why us 16GB VRAM / 32GB RAM users are experiencing issues where the TE reloads from disk (because it got unloaded to make space for the model) after changing the prompt and model reloads again, both of these contributing to huge slowdowns.

This loader doesn't unload back to RAM at all so it wont spill to pagefile. The idea is, if you dont have enough RAM, just dump it, because its faster to just read it from file on disk again than to write and read to pagefile. If you do have enough RAM, the OS will just leave the model in disk cache from the first load. So this should be faster for you.

You margins are very low, you might do well --disable-pinned-memory but if you try it, try it both ways.

kudos for LTX2 on 16 and these performance points is what im trying to really make work here.

MeiYi-dev · 2026-01-13T13:15:36Z

As a 16GB VRAM user using the LTX 2 model, the main issue for me currently is before VAE decoing occurs, the whole model gets offloaded to RAM, which is already loaded with the TEs/VAE/Latent Upscalers, etc, so it gets overloaded and spools onto pagefile. When in reality the max VRAM use of the VAE decoding part is 4GB with the "VAE decoded tiled" node, unloading the whole model (probably because of VAE estimations within comfyui not accounting for tiled decoding) is the biggest issue I have found and is the only reason why us 16GB VRAM / 32GB RAM users are experiencing issues where the TE reloads from disk (because it got unloaded to make space for the model) after changing the prompt and model reloads again, both of these contributing to huge slowdowns.

This loader doesn't unload back to RAM at all so it wont spill to pagefile. The idea is, if you dont have enough RAM, just dump it, because its faster to just read it from file on disk again than to write and read to pagefile. If you do have enough RAM, the OS will just leave the model in disk cache from the first load. So this should be faster for you.

You margins are very low, you might do well --disable-pinned-memory but if you try it, try it both ways.

kudos for LTX2 on 16 and these performance points is what im trying to really make work here.

Loading the model file from the disk again does seem to be a nice way to prevent useless writing to pagefile atleast. It will be very useful for 16GB RAM users. I use the model without any changes to startup args with GGUFs. It works perfectly, though the only issue currently is the offloading the whole model to RAM and pagefile to make space for 4GB haha

zwukong · 2026-01-13T14:27:16Z

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

FurkanGozukara · 2026-01-13T14:27:24Z

awesome i hope this gets implemented

MeiYi-dev · 2026-01-13T14:40:40Z

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

I am looking at the screenshots of this node, where does the model offload to? The ram use doesn't increase even with model only using 6GB VRAM, what's the caveat?

Edit: NVM it offloads to RAM

MeiYi-dev · 2026-01-13T14:47:54Z

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

This PR doesn't do any offloading to RAM like the node you mentioned. This PR just drops the model if enough RAM space is not found, and loads the files for each run (I think) using a faster way.

TLDR, it prevents useless writing to pagefile

zwukong · 2026-01-13T15:44:43Z

1、Large RAM savings. in your pic , saving about 1/2😲
2、Windows GPU shared memory usage avoidance. This is the main pain point ,which causes extremely slow speed
If it is true, comfyui will be the god 😄

zwukong · 2026-01-13T16:10:34Z

i tested some results in qwen edit using gguf. Speed the same, Ram almost the same,Vram seems more stable than before。

LTX2 gguf is very bad. Ram is more, speed 1/4, Vram full

reserve-vram is useless now, not good

Kosinkadink · 2026-01-13T16:47:35Z

@zwukong note, gguf is not officially supported in ComfyUI and requires the use of a custom node pack, which at this moment does not account for anything changed in this PR since it's so new. For testing, please don't use gguf models at this time!

anr2me · 2026-01-13T23:27:30Z

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

This PR doesn't do any offloading to RAM like the node you mentioned. This PR just drops the model if enough RAM space is not found, and loads the files for each run (I think) using a faster way.

TLDR, it prevents useless writing to pagefile

That is practically similar to what --normalvram do 😅

I usually use normalvram to minimize both RAM & VRAM usage. Unlike lowvram that forcefully store model in RAM after use, or highvram that forcefully keep the model in VRAM after use, normalvram will free the model after use when there is not enough free memory, thus avoiding swap file usage.

zwukong · 2026-01-14T01:46:05Z

GGUF should be your priority i think, 4090/5090 only 10%, 90% use gguf under 20G vram. Even kj uses gguf too,he has a 5090

rattus128 · 2026-01-14T02:14:01Z

The ComfyUI-ReservedVRAM node already let comfyui can run any model in any vram. LTX2 can run in 1G vram. Maybe you should take a look .

This PR doesn't do any offloading to RAM like the node you mentioned. This PR just drops the model if enough RAM space is not found, and loads the files for each run (I think) using a faster way.
TLDR, it prevents useless writing to pagefile

That is practically similar to what --normalvram do 😅

I usually use normalvram to minimize both RAM & VRAM usage. Unlike lowvram that forcefully store model in RAM after use, or highvram that forcefully keep the model in VRAM after use, normalvram will free the model after use when there is not enough free memory, thus avoiding swap file usage.

Don't think this flag does anything these days:

rattus@rattus-box2:~/ComfyUI$ git grep normalvram
comfy/cli_args.py:vram_group.add_argument("--normalvram", action="store_true", help="Used to force normal vram use if lowvram gets automatically enabled.")

^^ No users in code search.

lowvram semantics has been the default for a while, as models got too big to assume that users could fit them in VRAM by default.

kijai · 2026-01-14T14:15:18Z

GGUF should be your priority i think, 4090/5090 only 10%, 90% use gguf under 20G vram. Even kj uses gguf too,he has a 5090

I certainly do not use GGUF on 5090 or even 4090, that would be missing out on fp8 matmuls (the fast mode).

The common misconception seems to be that you have to use GGUF for low VRAM systems, which isn't true when offloading exists and is rather effective in ComfyUI currently. This PR would make it work even better.

Of course on low RAM systems there's less options for offloading and GGUF becomes useful.

zwukong · 2026-01-14T14:37:00Z

Yes, mainly for low RAM and less than 1/4 smaller size. FP8 or FP16 eats too much RAM ,at least twice. So if you use text encoder and unet both are fp8 , then fourth the size . And now RAM and SSD are more expensize than video card😄

asagi4 · 2026-01-14T15:08:19Z

comfy_aimdo seems to break ROCM even when this is not in use. It tries to dynamically load libcuda.so.1 which does not exist.

isaac-mcfadyen · 2026-01-14T16:10:10Z

Are there plans to open-source comfy-aimdo (similar to what was done with comfy-kitchen)?

comfyanonymous · 2026-01-14T22:48:56Z

comfy-aimdo will be open sourced before this pull request is merged.

For gguf we will try not to break it but we are focusing on improving our own native quant system to make it better/faster than gguf.

FurkanGozukara · 2026-01-14T22:57:21Z

comfy-aimdo will be open sourced before this pull request is merged.

For gguf we will try not to break it but we are focusing on improving our own native quant system to make it better/faster than gguf.

Any plans of converting existing models into NVFP4? I am trying to make FLUX SRPO managed it but quality dropped sharply

zwukong · 2026-01-15T01:47:20Z

@comfyanonymous I know it is fp4, but 40card can not run. We need Int4 as well, like nunchaku does. Why GGUF is the best for now is quality and size, even Q2 can get pretty good result, Q3 and above almost the same as fp16

Kosinkadink · 2026-01-15T01:49:36Z

There is a PR for int4. This PR is for memory management improvements with comfy-aimdo, would be appreciated if this thread was for testing this PR and not feature requests

zwukong · 2026-01-15T01:52:51Z

Not just feature requests, when i tested this PR, GGUF can not benefit. So I want GGUF to be supported too. Most of us uses the Great GGUFs. Almost all my models (about 99%) are GGUF

thrnz · 2026-01-15T02:52:33Z

Would the Windows shared memory avoidance stuff have any effect when using WSL?

If not, and with the changes now maximising VRAM usage (--reserve-vram seemingly no longer functions with dynamic_vram enabled), would that make spilling over into shared memory more likely with WSL?

I've noticed some slowdowns/stalls on subsequent runs along with the normal signs of shared memory sluggishness (low temperatures, low power draw, 100% GPU usage, along with reported shared memory usage in task manager) when testing out the PR, and wonder if the changes might not be suited for WSL as-is.

RandomGitUser321 · 2026-01-15T04:46:37Z

Would the Windows shared memory avoidance stuff have any effect when using WSL?

By default, Windows will only allow 1/2 of your system memory to be used by WSL(without modifying the .wslconfig with memory=24GB or whatever you want to set it to), so you're already going to run into issues quickly. But as far as I know, ComfyUI should pick up on that value and if it does, then it means any other memory management math should also pick up on it as well. Though at the GPU driver level, I'm not sure how they handle shared memory, when used with WSL.

rattus128 · 2026-01-15T05:51:42Z

Would the Windows shared memory avoidance stuff have any effect when using WSL?

If not, and with the changes now maximising VRAM usage (--reserve-vram seemingly no longer functions with dynamic_vram enabled), would that make spilling over into shared memory more likely with WSL?

I've noticed some slowdowns/stalls on subsequent runs along with the normal signs of shared memory sluggishness (low temperatures, low power draw, 100% GPU usage, along with reported shared memory usage in task manager) when testing out the PR, and wonder if the changes might not be suited for WSL as-is.

You are right that I ignore --reserve-vram for the moment. It can be implement with a bit of plumbing and ill take it as a feature request (along with --novram), but we might not do that one in V1 as you can just opt out for the interim.

Yeah so WSL is actually a big problem and very difficult (maybe impossible) to fix with regards to shared memory spilling. When you are under WSL you will present as linux to aimdo which wont have its anti-spill in play which is windows specific. Even if we could detect WSL we would not have access to the APIs needed to detect the spill as they are only visible on the host windows.

WSL has value from a linux familiarity point of view and solves some software packaging problems, but unfortunately the extra layer of indirection between comfy and the gpu creates multiple performance problems. If you optimizing comfy performance and like the linux env I VERY strongly recommend a dual boot setup as I have observed major performance differences in offloading setups where linux just beats windows with all over variables held the same (I dual boot my day-to-day test machine between Ubuntu and Win11).

Sync before deleting anything.

This is needed for aimdo where the cache cant self recover from fragmentation. It is however a good thing to do anyway after an OOM so make it unconditional.

Be more tolerant of unsupported platforms and fallback properly. Fixes crash when cuda is not installed at all.

If running on Windows, defer creation of the layer parameters until the state dict is loaded. This avoids a massive charge in windows commit charge spike when a model is created and not loaded. This problem doesnt exist on Linux as linux allows RAM overcommit, however windows does not. Before dynamic memory work this was also a non issue as every non-quant model would just immediate RAM load and need the memory anyway. Make the workaround windows specific, as there may be someone out there with some training from scratch workflow (which this might break), and assume said someone is on Linux.

The CoW MMAP as used by safetensors is hardcoded to CoW which forcibly consumes windows commit charge on a zero copy. RIP. Implement safetensors in pytorch itself with a READ mmap to not get commit charged for all our open models.

This isn't worth it and the likelyhood of inference leaving a complex data-structure with cyclic reference behind is now. Remove it. We would replace it with a condition on nodes that actually touch the GPU which might be win.

This is phase 2

This is needed for deepcopy construction. We shouldnt really have deep copies of MP or MODynamic however this is a stay one in some controlnet flows.

rattus128 · 2026-01-21T04:40:32Z

rebased to 0fc1570 (v0.10.0 +9)

mohtaufiq175 · 2026-01-21T12:11:35Z

@rattus128

I tried it with Flux 2 Klein 9B, using FP16 & qwen_3_8b_fp8mixed, running it on Windows 11, it seems the ram/memory usage still high?

Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11

launch command

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes

Result with dynamic vram

Without dynamic vram

Also i have this logs, these are from a different run than the screenshot above (with nvitop not running in the background), as I also wanted to compare the speed difference.

Dynamic Vram Logs

got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
got prompt
got prompt
got prompt
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.18s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 54.60 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00,  6.92s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 27.38 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.25s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.01 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.06s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 25.40 seconds

Without Dynamic Vram Logs

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
got prompt
got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
loaded completely; 5639.68 MB usable, 160.31 MB loaded, full load: True
got prompt
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5615.68 MB usable, 5278.84 MB loaded, 2984.51 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5428.44 MB usable, 5091.59 MB loaded, 3171.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1485.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.34s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 61.67 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.14s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.57 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.14s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.62 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.12s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.42 seconds

Now that the model defined dtype is decoupled from the state_dict dtypes we need to be able to handle worst case scenario casts between the SD and VBAR.

Scan created models and save off the dtypes as defined by the model creation process. This is needed for assign=True, which will override the dtypes.

If the model defines a dtype that is different to what is in the state dict, respect that at load time. This is done as part of the casting process.

rattus128 · 2026-01-21T22:54:40Z

@rattus128

I tried it with Flux 2 Klein 9B, using FP16 & qwen_3_8b_fp8mixed, running it on Windows 11, it seems the ram/memory usage still high?

Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11

launch command

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes

Result with dynamic vram

Without dynamic vram

Also i have this logs, these are from a different run than the screenshot above (with nvitop not running in the background), as I also wanted to compare the speed difference.
Dynamic Vram Logs

got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
got prompt
got prompt
got prompt
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.18s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 54.60 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00,  6.92s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 27.38 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.25s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.01 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.06s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 25.40 seconds

Without Dynamic Vram Logs

got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
got prompt
got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
loaded completely; 5639.68 MB usable, 160.31 MB loaded, full load: True
got prompt
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5615.68 MB usable, 5278.84 MB loaded, 2984.51 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5428.44 MB usable, 5091.59 MB loaded, 3171.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1485.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.34s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 61.67 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.14s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.57 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.14s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.62 seconds
Requested to load Flux2
loaded partially; 1483.95 MB usable, 612.02 MB loaded, 16704.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.12s/it]
Requested to load AutoencoderKL
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 160.31 MB offloaded, 13.50 MB buffer reserved, lowvram patches: 0
Prompt executed in 24.42 seconds

Thanks for the test. Can you confirm the version of the PR you tried (just type "git show")? Im making changes every day as things come in and if I can associate this data with specific revision that helps. Can I get your PCIE bus width and generation?

I am very very interested in your data if you do exactly the same setup with --disable-pinned-memory, both for your memory numbers and execution times.

The longer story: You RAM consumption as reported by nvtop is usually nothing to worry about as its measuring utilization as opposed to committed memory. Committed memory exhaustion is the one that OOMs and crashes systems. Open task manager and have a look at the memory page and you will see the "Committed" number. This should be lower with the PR. In this PR the model remains in RAM but as a soft uncommitted allocation which windows will automatically free if the system comes under RAM pressure (i.e. its not committed). Because you just load and use the same big model 4 times, this just flatlines on the peak which is fine. The pinned memory is however fully committed and a separate allocation. So if you have the RAM space it will keep around both the pinned copy and original copy of the model and nvitop will count both.

mohtaufiq175 · 2026-01-22T00:43:25Z

@rattus128 On the previous test, it was on commit 96e5d45

Anyway, Here is a new test run with the latest changes 2d96b2f. Sorry if it’s messy lol.

1. Without dynamic_vram

python main.py --use-sage-attention --fast fp16_accumulation --preview-method latent2rgb --disable-api-nodes

Logs

Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.459s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
loaded completely; 2963.00 MB usable, 160.31 MB loaded, full load: True
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5631.80 MB usable, 5294.59 MB loaded, 2968.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5465.60 MB usable, 5129.60 MB loaded, 3133.51 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1090.13 MB usable, 226.02 MB loaded, 17090.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.15s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 50.00 seconds
got prompt
loaded partially; 1088.13 MB usable, 224.02 MB loaded, 17092.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:17<00:00,  4.26s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 23.51 seconds

Peak Commited Memory

Second run

2. With dynamic_vram & pinned memory enable.

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes

Logs

Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.433s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
  return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.27s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 60.80 seconds
got prompt
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00,  6.82s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.64 seconds

The peak committed memory and the second run are similar, both stayed at around 47 GB.

3. With dynamic_vram & `--disable-pinned-memory`

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes --disable-pinned-memory

Logs

Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.458s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
  return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:27<00:00,  6.78s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 43.50 seconds
got prompt
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.47s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 27.09 seconds

The peak committed memory and the second run are similar; both stayed at around 26 GB.

==============================================================================

Anyway, just in case, this is a separate run of the same workflow with the batch count set to 4 to check the overall execution time. From a casual Comfy user’s perspective, the current Comfy implementation seems still faster, even though it uses more committed memory?

python main.py --use-sage-attention --fast fp16_accumulation --preview-method latent2rgb --disable-api-nodes

Logs

python main.py --use-sage-attention --fast fp16_accumulation --preview-method latent2rgb --disable-api-nodes
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : cudaMallocAsync
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.433s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
got prompt
got prompt
Requested to load AutoencoderKL
loaded completely; 2963.00 MB usable, 160.31 MB loaded, full load: True
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded partially; 5631.80 MB usable, 5294.59 MB loaded, 2968.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
loaded partially; 5425.45 MB usable, 5088.59 MB loaded, 3174.76 MB offloaded, 336.00 MB buffer reserved, lowvram patches: 0
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
loaded partially; 1090.13 MB usable, 226.02 MB loaded, 17090.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.24s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 47.55 seconds
loaded partially; 1088.13 MB usable, 224.02 MB loaded, 17092.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.99s/it]
Requested to load AutoencoderKL
loaded completely; 1100.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 21.86 seconds
loaded partially; 1086.13 MB usable, 196.02 MB loaded, 17120.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.04s/it]
Requested to load AutoencoderKL
loaded completely; 1098.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 22.09 seconds
loaded partially; 1056.13 MB usable, 192.02 MB loaded, 17124.00 MB offloaded, 864.00 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:15<00:00,  3.99s/it]
Requested to load AutoencoderKL
loaded completely; 1098.17 MB usable, 160.31 MB loaded, full load: True
Prompt executed in 21.85 seconds

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes

Logs

Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.442s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
got prompt
got prompt
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
  return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:22<00:00,  5.70s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 47.82 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.42s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 25.11 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:22<00:00,  5.73s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 23.81 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.87s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 24.12 seconds

python main.py --use-sage-attention --fast fp16_accumulation dynamic_vram --preview-method latent2rgb --disable-api-nodes --disable-pinned-memory

Logs

Found comfy_kitchen backend triton: {'available': True, 'disabled': True, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8']}
Found comfy_kitchen backend cuda: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Found comfy_kitchen backend eager: {'available': True, 'disabled': False, 'unavailable_reason': None, 'capabilities': ['apply_rope', 'apply_rope1', 'dequantize_nvfp4', 'dequantize_per_tensor_fp8', 'quantize_nvfp4', 'quantize_per_tensor_fp8', 'scaled_mm_nvfp4']}
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0+cu130
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : native
Using async weight offloading with 2 streams
working around nvidia conv3d memory bug.
Using sage attention
DynamicVRAM support detected and enabled
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.10.0
ComfyUI frontend version: 1.37.11
[Prompt Server] web root: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static

Import times for custom nodes:
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\websocket_image_save.py
   0.0 seconds: F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\custom_nodes\ComfyUI_essentials

Context impl SQLiteImpl.
Will assume non-transactional DDL.
Assets scan(roots=['models']) completed in 0.520s (created=0, skipped_existing=1689, total_seen=1692)
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\comfy\utils.py:94: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\torch\csrc\utils\tensor_new.cpp:1587.)
  data_area = torch.frombuffer(mapping, dtype=torch.uint8)[8 + header_size:]
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
got prompt
got prompt
Requested to load AutoencoderKL
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Found quantization metadata version 1
Using MixedPrecisionOps for text encoder
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
F:\AI\ComfyUI-Nightly-AlwaysUpToDate\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py:2954: UserWarning: Mismatch dtype between input and weight: input dtype = float, weight dtype = struct c10::BFloat16, Cannot dispatch to fused implementation. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\layer_norm.cpp:347.)
  return torch.rms_norm(input, normalized_shape, weight, eps)
0 models unloaded.
Model Flux2TEModel_ prepared for dynamic VRAM loading. 8263MB Staged. 0 patches attached.
model weight dtype torch.float16, manual cast: None
model_type FLUX
Requested to load Flux2
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:26<00:00,  6.55s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 42.52 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:25<00:00,  6.33s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.37 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.24s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.09 seconds
0 models unloaded.
Model Flux2 prepared for dynamic VRAM loading. 17316MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████| 4/4 [00:24<00:00,  6.23s/it]
0 models unloaded.
Model AutoencoderKL prepared for dynamic VRAM loading. 320MB Staged. 0 patches attached.
Prompt executed in 26.08 seconds

comfy-pr-bot · 2026-01-22T03:36:58Z

Test Evidence Check

isaac-mcfadyen · 2026-01-22T03:41:54Z

Test Evidence Check

Not sure if there's a better place to raise this but this bot has commented on most (every?) PR with this message even if there are no issues in the description, meaning it's quite a lot of noise (an email from GitHub per subscribed PR).

rattus128 · 2026-01-22T04:59:51Z

@rattus128 On the previous test, it was on commit 96e5d45

Anyway, Here is a new test run with the latest changes [2d96b2f](https://github.com/Comfy-Org/ComfyUI/

Thanks for this data. It's definitely worth looking into and I am tracking it here: rattus128#2

I'll look into it when I get a chance, I have a system pretty similar to yours (3060+64GB) so hopefully I clean reproduce.

rattus128 requested review from Kosinkadink, comfyanonymous and guill as code owners January 13, 2026 10:20

rattus128 marked this pull request as draft January 13, 2026 10:29

rattus128 added 17 commits January 21, 2026 14:32

mm: fix sync

2e1c266

Sync before deleting anything.

write better tx commentary

17cdb02

add missing del on unpin

645c459

misc cleanup

5916464

ruff

307d25e

sd: empty cache on tiler fallback

28dd1c4

This is needed for aimdo where the cache cant self recover from fragmentation. It is however a good thing to do anyway after an OOM so make it unconditional.

clip: support assign load when taking clip from a ckpt

cb41b22

sampling: improve progress meter accuracy for dynamic loading

265ae3e

main: Rework aimdo into process

b0d6f2a

Be more tolerant of unsupported platforms and fallback properly. Fixes crash when cuda is not installed at all.

aimdo version bump

b5806c8

remove junk arg

82f388f

implement lightweight safetensors with READ mmap

ec5a81c

The CoW MMAP as used by safetensors is hardcoded to CoW which forcibly consumes windows commit charge on a zero copy. RIP. Implement safetensors in pytorch itself with a READ mmap to not get commit charged for all our open models.

execution: remove per node gc.collect()

d4f8950

This isn't worth it and the likelyhood of inference leaving a complex data-structure with cyclic reference behind is now. Remove it. We would replace it with a condition on nodes that actually touch the GPU which might be win.

mm: remove left over hooks draft code

2adbbd6

This is phase 2

mp: handle blank __new__ call

f93e09a

This is needed for deepcopy construction. We shouldnt really have deep copies of MP or MODynamic however this is a stay one in some controlnet flows.

nodes_model_patch: fix copy-paste coding error

aef8d00

rattus128 force-pushed the prs/dynamic-vram branch from 549ce2d to aef8d00 Compare January 21, 2026 04:39

ruff

96e5d45

rattus128 added 4 commits January 21, 2026 23:57

mp: big bump on the VBAR sizes

6e641d8

Now that the model defined dtype is decoupled from the state_dict dtypes we need to be able to handle worst case scenario casts between the SD and VBAR.

archive the model defined dtypes

4979c07

Scan created models and save off the dtypes as defined by the model creation process. This is needed for assign=True, which will override the dtypes.

ops: fix __init__ return

65b9729

MPDynamic: Add support for model defined dtype

2d96b2f

If the model defines a dtype that is different to what is in the state dict, respect that at load time. This is done as part of the casting process.

Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading #11845

Are you sure you want to change the base?

Reduce RAM usage, fix VRAM OOMs, and fix Windows shared memory spilling with adaptive model loading #11845

Conversation

rattus128 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Implementation Details

Future Work

Example Test case:

Uh oh!

socket-security bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MeiYi-dev commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattus128 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MeiYi-dev commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zwukong commented Jan 13, 2026

Uh oh!

FurkanGozukara commented Jan 13, 2026

Uh oh!

MeiYi-dev commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MeiYi-dev commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zwukong commented Jan 13, 2026

Uh oh!

zwukong commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kosinkadink commented Jan 13, 2026

Uh oh!

anr2me commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zwukong commented Jan 14, 2026

Uh oh!

rattus128 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kijai commented Jan 14, 2026

Uh oh!

zwukong commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asagi4 commented Jan 14, 2026

Uh oh!

isaac-mcfadyen commented Jan 14, 2026

Uh oh!

comfyanonymous commented Jan 14, 2026

Uh oh!

FurkanGozukara commented Jan 14, 2026

Uh oh!

zwukong commented Jan 15, 2026

Uh oh!

Kosinkadink commented Jan 15, 2026

Uh oh!

zwukong commented Jan 15, 2026

Uh oh!

thrnz commented Jan 15, 2026

Uh oh!

RandomGitUser321 commented Jan 15, 2026

Uh oh!

rattus128 commented Jan 15, 2026

Uh oh!

rattus128 commented Jan 21, 2026

Uh oh!

mohtaufiq175 commented Jan 21, 2026

Uh oh!

rattus128 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rattus128 commented Jan 13, 2026 •

edited

Loading

socket-security bot commented Jan 13, 2026 •

edited

Loading

MeiYi-dev commented Jan 13, 2026 •

edited

Loading

rattus128 commented Jan 13, 2026 •

edited

Loading

MeiYi-dev commented Jan 13, 2026 •

edited

Loading

MeiYi-dev commented Jan 13, 2026 •

edited

Loading

MeiYi-dev commented Jan 13, 2026 •

edited

Loading

zwukong commented Jan 13, 2026 •

edited

Loading

anr2me commented Jan 13, 2026 •

edited

Loading

rattus128 commented Jan 14, 2026 •

edited

Loading

zwukong commented Jan 14, 2026 •

edited

Loading

rattus128 commented Jan 21, 2026 •

edited

Loading

3. With dynamic_vram & `--disable-pinned-memory`